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1. Introduction 

There are many problems in mathematics that can be solved by a search over all 
possibilities. In combinatorial group theory every classical decision problem has its 
natural "search" variation. For example, the search version of the Word Problem 
for a group given by a presentation {X \ R) would ask for a word u from the 
normal closure of the set of relators R to produce an expression of u as a product of 
conjugates of elements from R . The search Conjugacy Problem would require to 
produce a conjugating clement, and the search Membership Problem would ask for 
an expression of a given u from a subgroup (v\ , . . . , Vk) as a product of the generators 
(and their inverses) v 1 , . . . , v k . In a free group finding a solution of a given 
consistent equation or determining whether or not a given element has minimal 
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length in its automorphic orbit (Whitehead minimality problem) are typical search 
problems. All these problems are recursively enumerable, so, in principal, the total 
search would produce an answer sooner or later. Unfortunately, in practice the 
total search could be extremely inefficient: there are not any recursive bounds on 
time complexity of the search word, conjugacy, or membership problems, and there 
is no algorithm known to find solutions of equations or determine minimality of a 
word in a free group in time better than exponential time in the size of the problem. 
In this paper, we offer some insights to speed up the solving of computational hard 
problems that arise in certain problem populations of free groups. We will show that 
with some intelligent reasoning, a significant fraction of problems in the population 
can be solved much quicker than exponential time. 

The classical way algorithms are formulated in search problems is to design 
a way to perform the associated tree search and try to improve the solution time 
by fancier data structures, more efficient code, and by heuristics. Our point of 
view is to improve how we solve these problems using the experience of having 
solved many such problems from the given population. Thus our problem domain 
is given by a population of problems. We will sample many problem instances 
from the population and solve each instance. We will then examine the statistical 
characteristics in the tree search of how each problem instance was solved and then 
incorporate this statistical derived knowledge from the experience of solving these 
problems into a smarter tree search. There are situations in which we may not have 
to do any tree searching at all and our pattern recognition techniques will do all 
the work. 

The use of this kind of experience may not improve the worst case complexity. 
But it can dramatically improve the computational complexity on a large fractions 
of problems from the population. For example, we may discover that 99 percent of 
the problems can be solved in linear time, 0.9 percent in quadratic time, and the 
remainder 0.1 percent in exponential time. 

There are two dimensions to using this pattern recognition technology. The 
first dimension is the representation of any problem instance to a fixed dimensional 
feature vector which captures the information in the problem instance. The second 
dimension is the use of standard tools in pattern recognition that given the feature 
vector designates the most probable class or the most probable next successful step 
in the tree search. 

In this paper we discuss briefly several typical pattern recognition techniques 
and demonstrate some of their applications by the Whitehead's minimality problem. 
For notations and relevant results on Whitehead method we refer to the paper [4]- 

2. General remarks on Pattern Recognition tasks 

One of the main applications of Pattern Recognition (PR) techniques is classifi- 
cation of a variety of given objects into categories. Usually classification algorithms 
or classifiers try to find a set of measurements (properties, characteristics) of ob- 
jects, called features, which gives a descriptive representation for the objects. 

Generally, pattern recognition techniques can be divided in two principal types: 

• supervised learning; 

• unsupervised learning {clustering). 

In supervised learning the decision algorithms are "trained" on a prearranged 
dataset, called training dataset in which each pattern is labelled with its true class 
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label. If such information is not available, one can use clustering. In this case clus- 
tering algorithms try to find clusters or "natural groupings" of the given objects. 
In this paper we use supervised learning pattern recognition algorithms. 

Every pattern recognition task of the supervised learning type has to face all 

of the following issues: 

(1) Obtaining the data. The training datasets can be obtained from the 
real world or generated by reliable procedures, which provide independent 
and representative sampled data. 

(2) Feature extraction. The task of feature extraction is problem specific 
and requires knowledge of the problem domain. If such knowledge is 
limited then one may consider as many features as possible and then try 
to extract the most "significant" ones using statistical methods. 

(3) Model selection. The model is the theoretical basis for the classifier. A 
particular choice of the model determines the basic design of the classifier 
(though there might be some variations in the implementation). Model 
selection is one of the most active areas of research in pattern recognition. 
Usually model selection is closely related to the feature extraction. In 
practice, one may try several standard models starting with the simplest 
ones or more economic ones. 

(4) Evaluation. Evaluation of the performance of a particular PR system 
is an important component of the pattern recognition task. It answers 
the question whether or not the given system performs as required. To 
evaluate a system one can use various accuracy measures, for example, 
percentage of correct answers. To get reliable estimates other sets of data 
that are independent from the training sets must be used. Such sets are 
called test datasets. 

Typically we view a PR system as consisting of components l)-4). 

(5) Analysis of the system. Careful analysis of performance of a par- 
ticular classifier may improve feature extraction and model selection. For 
example, one can look for an optimal set of features or for a more effective 
model. Moreover, through analysis of the most significant (insignificant) 
features one may gain a new knowledge about the original objects. 

3. Feature Vectors 

A feature vector is just a vector of properties about the object of interest, in 
our case about the mathematical object of interest. The properties are chosen to 
be ones thought to be relevant to the problem. We will illustrate the selection of 
features by the Whitehead minimal problem for a free group F — F(X) with basis 
X. 

Let w be a reduced word in the alphabet £ X^ 1 . Below we describe features 
of w which characterize a certain placement of specific words from F(X) in w. 

Let K £ N be a natural number, v\, . . . , Vk £ F(X) be words from F(X), and 
Ux, . . . , U K +i Q F(X) be subsets of F(X). Denote by 

C(w,UiVi . ..v K U K +i) 

the number of subwords of the type 

UlWlU 2 •• -v K u K+1 , 
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where Uj G Uj, which occur in w. For fixed K, Vi, . . . , Vk, U\, . . . , Uk+i we obtain 
a counting function 

(1) weF — >C{w,U 1 v 1 ...v K U K+l ) eN 

The normalized value 

T—rC(w, UlVl . . . v K U K +i) 
\w\ 

is called a feature of io and the function 

w e F — ► -—C(w, Uwx . . . v K U K +i) e K 
M 

is called a feature function on i*\ Usually we omit 17, in our notations if [/,; = 0. 
If C = (Ci(w), . . . ,Cn(w)) is a sequence of counting functions like (JIJ one can 
associate with w a vector of real numbers: 

fcH = A < c i W' • • • ,cw(w) > e R N 

\w\ 

which is called a feature vector. Every choice of the sequence C gives a vector fc(w) 
which reflects the structure of w. 

For example, if a € X then C(w, a) counts the number of occurrences of 
the letter a in w. The feature vector (where for simplicity we assume that the 
components are written in some order which we do not specify) 

fo(w)^^-<C(w,a)\aeX ±1 > 
\w\ 

shows the frequencies of occurrences of letters from X in w. The feature vector 

AH = j^<C(w,v)\ \v\=2> 
\w\ 

shows the numbers of occurrences of words of length two in w relative to the length 
of w. 

To visualize some structures described by the counting functions above we 
associate with a given word w G F(X) a weighted labelled directed graph T{w). 
Put V(Y(w)) — X . For given x, y G X ±x and v G F(X) we connect the vertex x 
to the vertex y by an edge with a label v and weight C(w,xvy). Now, with every 
edge from x to y with label xvy one can associate a counting function C(w,xvy), 
and vice versa. It follows that every subgraph T of T(w) gives rise to a particular 
set of counting functions Cr of the type C(w, xvy), and conversely, every set C of 
counting functions of the type C(w,xvy) determines a subgraph Tq of T(w). For 
instance, the feature mapping f\ corresponds to the subgraph Fi(ui) of T(w) which 
is in a sense a directed version of the so-called Whitehead graph of w. 

Let U n be the set of all words in F that are length n. Let W n be the set of all 
words in F that are of length n or less. Other relevant features can be defined as 
follows. Each corresponds to various subgraphs of the graph r(ii;): 

h{w) = i— , < C{w,x 1 U 1 x 2 ) | Xi,X2 G X ±l >; 

\w\ 

h(w) = r~\ < C(w,x 1 U 2 x 2 ) I x l7 x 2 G X ±x >; 

M 

fi{w) = -— - < C(w,XiL r 3 a:2) | xi,x 2 G X ±x >; 



PATTERN RECOGNITION IN FREE GROUPS 5 

f 5 {w) = — < C{w,X 1 W 1 X 2 ) | Xl,X 2 e X* 1 >; 
M 

/ 6 (ty) = r-r < C^ZiW^) | x u x 2 & X ±x > . 
\w\ 

4. Pattern Recognition Tools and Models 

There are a variety of pattern recognition tools that are useful for determining 
a way of making a distinction given a set of feature vectors from one class and a 
set of feature vectors from another class. For each given class of objects, we sample 
objects from the class and construct the set of corresponding feature vectors. The 
pattern recognition technology provides a way of determining a best or near best 
boundary in the feature space that distinguishes the one class from the other. In 
this section we will review some of the basic techniques. The reader interested in a 
fuller discussion may consult general references P] . [3 ] [5 ] . p] . 

4.1. Principal Components. There are occasions when the feature vectors 
coming from a class either all lie in a small dimensional flat or most of them lie in 
a small dimensional flat. Principal component analysis can determine this. 

Let X\, . . . , xm be the set of feature vectors sampled from a given class. Define 
their sample mean \x by 

1 N 
» = n ^ Xn 

71=1 

Define their sample covariance C by 

1 N 

Let ti, . . . , tzv be the eigenvectors of C with corresponding eigenvalues Ai > A2 . . . > 
\n- Should the feature vectors indeed lie in a small dimensional flat, then there 
will be a K < N such that \ k = 0, k = K + l,...,N. 

In this case, feature vectors x coming from objects that are in the class can be 
recognized by testing whether or not 

\\T( X -»)\\=0 

where T is a N x (N — K) matrix whose columns are eigenvectors tK+i, ■ ■ ■ , tjv- 
Those in the class will have \\T(x — /x)|| = 0. \\T(x — /x)|| > 0|| is a sure 

indication that x comes from an object out of the class, but there may be some 

objects out of the class for which \\T(x — fi)\\ =0. 

In the case of two classes, we form the matrix T% from the zero eigenvalue 

eigenvectors of the covariance matrix from the class one feature vectors and the 

matrix T 2 from the zero eigenvalue eigenvectors of the covariance matrix from the 

class two feature vectors. 

Now if \\T\(x — /xi)|| = and ||T 2 (x — A*2)|| > 0, we assign vector x to class 

one. If [ |Ti (a; — /Xi)|| > and \\T 2 {x — Hi)\\ = 0, we assign vector x to class two. 

If ||Ti(x — /xi)|| = and \\T%{x — H2)\\ = 0, feature vector x comes from an object 

that is both class 1 and class 2. If ||Ti(a; — fii)\\ > and \\Ti{x — H2)\\ > feature 

vector x comes from an object that is neither class 1 nor class 2. 
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4.2. Classifying by Distance. Let T\ and T2 be denned as before. We form 
the discriminant function 

f(x) = \\T l (x-K)\\-\\T 2 (x- l * 2 )\\ 

which measures the difference between feature vector x and the flat associated with 
class 1 and the flat associated with class 2. The decision rule is to assign vector 
x to class 1 if f(x) > 8, otherwise assign to class 2. Here, after the discriminant 
function is defined we determine the value of that minimizes the error. 

Classifying by distance can also be done with respect to the class means. Here 
the discriminant function is defined by 

fix) = {x- fi 1 )'C^ 1 (x - m) -{x- n 2 )'C2 l (x - /i 2 ) 

As before, the decision rule is to assign vector x to class 1 if f(x) > 6, otherwise 
assign to class 2. Here also after the discriminant function is defined we determine 
the value of that minimizes the error. 

4.3. Linear Classifiers. Classifying may be done by a linear decision rule. 
Here the discriminant function is given by 

f(x) = v x 

where vector v is the weight vector and is the normal to the hyperplane separating 
the feature space into two parts. 

If f(x) < 8 the decision rule is to assign the vector x to class 1 otherwise to 
class 2. There are a variety of ways to construct the weight vector v. One mthod is 
by regression. Another is names the Fisher linear discriminant. A third is called 
the support vector machine approach. 

4.3.1. Regression Classifier. In the regression classifier, we form a matrix A 
whose rows are the feature vectors. We form a vector b whose kth component is 
if the kth feature vector comes from class one and whose kth component is 1 is the 
kth feature vector comes from class two. We determine the weight vector v as that 
vector that minimizes \\Av — b\\. The minimizing vector v is given by the normal 
equation 

v = {A'A)- l A'b 
The discriminant function is defined by 

f(x) = v'x 

We assign a vector to class one if f(x) < 9 and to class two otherwise. 8 is chosen 
to minimize the error of the assignment. 

4.3.2. Fisher Linear Discriminant. Fisher's linear discriminant function is ob- 
tained by maximizing the Fisher's discriminant ratio, which, as described below, is 
the ratio of the projected between class scatter to the projected within class scatter. 

Let v be the unknown weight vector. Let [i\ and /Z2 be the class one and 
two sample means and let C\ and Ci be the class one and two sample covariancc 
matrices. Let N\ be the number of feature vectors in class one and let Ni be the 
number of feature vectors in class two. Define the overall mean /1 by 

/l = Pi/Ui + P2M2 
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where Pi = N 1 /(N 1 + N 2 ) and P 2 = N 2 /{N 1 + N 2 ). Then the between-class scatter 
matrix Sb is given by 

2 

Sb = y^PjQui -M)(Mi ~m)' 
i=i 

= PlAfjUl - M2)(Ml - M2)' 

Define S^ to be the average class conditional scatter matrix, then 

2 

i=l 

Finally, if we let S 1 designate the scatter matrix of the mixture distribution, 

1 N 

s = Nl+N2 Y,^ xk -^ xk -^ 

then 

In the one dimensional projected space one can easily show that the projected 
between class scatter Sb and the projected within-class scatter s w are expressed as 

s b = v'S b v 

S w = V J W V 

Then the Fisher discriminant ratio is defined as 

s b v'S b v 
F (v) = — = -rTT— 

S w V OwV 

The optimum direction v can be found by taking the derivative of F(v) with 
respect to v and setting it to zero: 

VF(«) = (v'S w v)- 2 (2S b w'S w v - 2v'S b vS w v) = 

From this equation it follows that 

v'SbvS w v = Sbw'S w v 

If we divide both sides by the quadratic term v'SbV, then 

V'SyjV 

OwV = ——; )S b V 

v'bbV 
= XSbV 

= XP 1 P 2 (^ 1 - A^Xmi - (J>i)'v 
= Xk(pi - /i 2 ) 

where A and k are some scalar values defined as 

_ v'S w v 
v'SbV 
k = P\P 2 {ni - n 2 )'v 
Thus we have the weighting vector v as 

v = KS~ 1 (m -/i 2 ) 

where K — Xk is a multiplicative constant. 
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The discriminant function is denned by f(x) — v'x. The vector x is assigned to 
class one if F(x) > 9 and to class two otherwise. The threshold 9 is set to a value 
that minimizes the error of the class assignment. 

4.3.3. Support Vector. Let z\,...,zn be the set of N training vectors with 
corresponding labels yi, . . . , tjn, a label being +1 for class one and -1 for class two. 
Consider a hyperplane ribbon R that can separate the training vectors of class one 
from the training vectors of class two. We represent R by 

R = {x\l <w'x < 1} 

The support vector machine approach seeks to hnd the widest ribbon so that R 
separates the vectors of class one from class two. 

The distance of the hyperplane H = {x w'x =1} from the origin is tt-h- 

Therefore the ribbon R has width Tl — m-- To maximize this width is equivalent 

IMII H 

to minimizing ||to||. This minimization must be done under the constraint that 
ykw'zk > 1, k= 1,. . . , K. Define the matrix A by 

( yiz \ \ 

A = 



V2Z' 2 



V Vkz'k J 
Define the vector b to be a vector of K components all of which have value 1. The 
support vector approach determines the vector w by minimizing w'w subject to 
the constraint Aw > b. This can be solved by standard quadratic programming 
methods. 

4.4. Quantizing. Let / be a discriminant function. We evaluate f(x) over 
all the sampled vectors x from class one and from class two to determine the range. 
We divide the range in a fixed number M of quantizing intervals. The simplest 
way is called equal interval quantizing. Here the range is divided up into M equal 
intervals. In each interval the number of sampled vectors coming from class one 
and coming from class two is determined. The interval is labelled by the class of 
the majority of the vectors in it. 

A vector x having discriminant value f(x) which falls into the mth quantizing 
interval is assigned to the class that labels the quantizing interval. 

Another simple alternative quantizing scheme is to divide the range into inter- 
vals each of which have the same number of sampled discriminant values. This is 
called equal probability quantizing. 

A more complex scheme is to divide the discriminant range into M intervals in 
such a way that the classification error is minimized. 

4.5. Classification Trees. The type of classification tree discussed here is a 
binary tree with a simple discriminant function; thus every nonterminal node has 
exactly two children [J. During classification, if the node's discriminant function 
is less than a threshold, the left child is taken; if it is greater than the threshold, 
the right child is taken. This section describes the design process of the binary tree 
classifier using a simple discriminant function. There are two methods of expanding 
a nonterminal node according to the selection of a decision rule for the node. We 
show how to use an entropy purity function to decide what the threshold value 
should be, and we discuss the relationship of the purity function to the \ 2 test 
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statistic. We discuss the criteria for deciding when to stop expanding a node and 
for assigning a class. 

Let xi, . . . , xn be the set of vectors in the training set. Associated with each 
vector is a class label. Let M be the number of classes. Let 

X n = {xl | k = l,...,N n } 

be the subset of N n training vectors associated with node n. Let JV™ be the number 
of training vectors for class c in node n. Since N n is the total number of training 
samples in node n, we must have N n — X) c =i ^c- The decision rule selected for 
node n is that discriminant function having the greatest purity, a quality we will 
precisely define later. 

Now wc define how the decision rule works at node n. Consider the feature 
vector x^. If the discriminant function f(x^) is less than or equal to the threshold, 
then x2 is assigned to class ^2bftj otherwise it is assigned to class Wright- An 
assignment to ^l EFT means that the feature vector descends to the left child node. 
An assignment to Bright m e an s that the feature vector descends to the right child 
node. 

Given a discriminant function /, we sort the feature vectors in the set X n in 
an ascending order according to their discriminant function value. Without loss 
of generality we assume that the feature vectors are sorted in such a way that 
f( x k) — f( x k+i) for fe = 1, . . . , N n — 1,. Let w% be the true class associated with 
the measurement vector x^.. Then the set of candidate thresholds T n is defined by 

For each possible threshold value, each feature vector x£ is classified by using 
the decision rule specified above. We count the number of samples n t Lc assigned to 
^left wnose true class is c, and we count the number of samples n Rc assigned to 
Bright whose true class is c. 



n L 



L c 



#{k f{xt) < t and < = c} 



n Rc = #{fc /(sjb) > * and < = c} 

Let n^ be the total number of samples assigned to Q^Ieft ano - n R he the total 
number of samples assigned to Bright ' that is, 

M 



E' 



c=l 
M 
J2 n Rc 



c=l 



We define the purity PR n of such an assignment made by node n to be 

M 

PR n = J2 ("L ln ?>L + n Rc logp Rc ) 



where 



PLc — — 
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n t __ n Rc 

The discriminant threshold selected is the threshold t that maximizes the purity 
value PR n - The purity is such that it gives maximum value when the classes of 
the training vectors are completely separable. For example, consider a nonterminal 
node having m units in each of three classes in the training sample. If the selected 
decision rule separates the training samples such that the LEFT child contains all 
feature vectors in one class and the RIGHT child contains all the feature vectors 
in the other two classes, the purity is — 2(mln ^) = —2m In 2. In the worst case, 
when both the LEFT and the RIGHT children contain the same number of feature 
vectors for each class, the purity is — 3(y bi§) — 3(^lni) = ~ 3mln3. Thus we 
can easily see that the purity value of the former case, where the training samples 
are completely separable, is greater than the purity value of the latter case, where 
the training samples are not separable. 

The maximization of the purity can also be explained in terms of the % 2 test 
of goodness of fit. If a decision node is effective, the distribution of classes for 
the children nodes will be significantly different from each other. A statistical test 
of significance can be used to verify this. One test statistic that measures the 
significance of the difference of the distributions is defined by 

M ( TV™ 

X 2 = H ( n Lc Wlc + n Rc W Rc - N? In -± 

c=l ^ 

It has a y 2 distribution with M — 1 degree of freedom. Comparing this equation 
with PR^, we find that the \ 2 value is just the sum of the purity PR l n and some 
constant value. 

Now we discuss the problem of when to stop the node expanding process and 
how to assign a class to the terminal node. First, it is not reasonable to generate 
a decision tree that has more terminal nodes than the total number of training 
samples. Using this consideration as a starting point, we set the maximum level 
of the decision tree to be log 2 N 1 — 1, which makes the number of terminal nodes 
less than ^-, where iV 1 is the number of training samples in node 1, the root node. 
Next, if the x 2 value is small, the distributions of classes for the children nodes are 
not significantly different from each other, and the parent node need not be further 
divided. Finally, when the number N n of units at node n becomes small, the \ 2 
test cannot give a reliable result. Therefore we stop expanding node n when N n is 
less than some lower limit. If one of these conditions is detected, then the node n 
becomes a terminal node. 

When a node becomes terminal, an assignment of a label to the node is made. 
Each terminal node is assigned that class label that is the majority of the class 
labels of the training vectors in the node. 

An alternative decision tree construction procedure uses the probability of mis- 
classification in place of entropy. In this procedure the decision rule selected for the 
nonterminal node is the one that yields the minimum probability of misclassifica- 
tion of the resulting assignment. To describe the termination condition of a node 
expansion, we first define type I and type II errors as follows: Let type I error be 
the probability that a unit whose true class is in ^Ieft ^ s classified as Bright > 
and type II error be the probability that a unit whose true class is in Qrjqht ^ s 
classified as ^Ieft- Then, if the sample space is completely separable, we would 
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get zero for both type I and type II errors. Since this is not always the case, we 
control these errors by considering only those thresholds in the process of threshold 
selection that give type I error less than ej and type II error less than en , where ej 
and en are values determined before we start constructing the decision tree. Next, 
in the process of expanding a nonterminal node, if we cannot find a decision rule 
that gives type I error less than e/ and type II error less than en, which means 
that the sample space is not separable at a ej and en level, we stop expanding this 
nonterminal node. This process of decision tree construction is repeated until there 
is no nonterminal node left or the level of the decision tree reaches the maximum 
level. Assignment of a class to a nonterminal node is done in the same way as in 
the previous procedure. 

As just described, there are two groups of classes at each nonterminal node in 
the binary decision tree. For the purpose of discussion, we denote all the classes in 
one group as Qleft and all the classes in the other group as Bright- The job of 
the decision rules discussed here is to separate the left class FIleft from the right 
class Bright- We will employ the same notational conventions used previously. 
The superscript n denoting the node number will be dropped from the expression 
if it is clear from the context that we are dealing with one particular node n. 

The simplest form for a discriminant function to take is a comparison of one 
measurement component to a threshold. This is called a threshold decision rule. If 
the selected measurement component is less than or equal to the threshold value, 
then we assign class Qleft to the unit Uk', otherwise we assign class Q right to it. 
This decision rule requires a feature vector component index and a threshold. Each 
feature vector component is selected in turn and the set of threshold candidates T 
of that component is computed. For each threshold in the set T, all vectors in the 
training set X n at node n are classified into either class Qleft or class Bright 
according to their value of the selected feature component. The number of feature 
vectors for each class assigned to class Qleft and to class Bright is counted, 
and the entropy purity is computed from the resulting classification. A threshold 
is selected from the set of threshold candidates T such that, when the set X n is 
classified with that threshold, a maximum purity in the assignment results. This 
process is repeated for all possible feature components, and the component and 
threshold that yield an assignment with the maximum purity is selected. 

4.6. Clustering. All the previous methods required that the feature vectors 
be generated for each of the objects to be discriminated and an associated class 
label be associated with each of the feature vectors. In this section we discuss 
a method to automatically determine the natural classes, called clusters, to be 
associated with each feature vector. Feature vectors with the same cluster label 
are more similar to each other and less similar to feature vectors of a different 
cluster label. Clustering can be used to help generate hypotheses about natural 
distinctions between mathematical objects before these distinctions themselves are 
known. 

The most widely used clustering scheme is called K-mcans. It is an iterative 
method. It begins with a set of K cluster centers [i®, . . . , n° K . Initially these are 
chosen at random. At iteration t each feature vector is assigned to the cluster center 
to which it is closest. This forms index sets S\, . . . , S^. 

St = {n | |K -/4|| < \\x n -f4 n \\,m= 1,...,K} 
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Then each cluster center is redefined as the mean of the feature vectors assigned 
to the cluster. 

Pk = T/ToF Z^ Xn 
* k nesi 

Each iteration reduces the criterion function J*. 

K 

Jt = EE n^-/4ii 

k=lneS t k 

As this criterion function is bounded below by zero, the iterations must con- 
verge. 

5. Recognizing Whitehead Minimal Words in Free Groups 

In this section we return to the Whitehead minimal word problem. We first 
discuss a pattern recognition system based on linear regression, then we show some 
applications of clustering. 

5.1. Linear regression classifier. Data sets. To train a classifier, we must 
have a training set. To test the classifier we must have an independent test set. 

A "random" element w of F — F(X) can be produced as follows. Choose ran- 
domly a number I (the length of w) , and a random sequence y\ , . . . , yi of elements 
yi G X such that yi ^ 2/Z_n where y\ is chosen randomly and uniformly from 
X ±x , and yi+i is chosen randomly and uniformly from the set X — {y^~ }. Sim- 
ilarly, one can pseudo-randomly generate cyclically reduced words in F, i.e., words 
w = yi . . . yi where y x ^ yf l . 

To generate the training data set we used the following procedure. For each 
positive integer 1 = 1,..., 1000 we generate randomly and uniformly 10 cyclically re- 
duced words from F(X) of length I. Denote the resulting set by W. Then using the 
deterministic Whitehead algorithm one can effectively construct the corresponding 
set of minimal elements 

w mm = {w mm \w e W}. 

With probability 0.5 we substitute each v £ W m i n with the word v*, where t is a 
randomly and uniformly chosen Whitehead automorphism such that |u*| > \v\ (if 
|i;*| = \v\ we chose another automorphism t, and so on). Now, the resulting set 
D is a set of pseudo-randomly generated cyclically reduced words representing the 
classes of minimal and non-minimal elements in approximately equal proportions. 
We choose D as the training set. 

One remark is in order here. It seems, the class of non-minimal elements in 
D is not quite representative, since every one of its elements w has Whitehead 
complexity 1, i.e., there exists a single Whitehead automorphism which reduces w 
to w m i n (see |3] for details on Whitehead complexity). However, our experiments 
showed that the set D is a sufficiently good training dataset which is much easier to 
generate than a set with uniformly distributed Whitehead complexity of elements. 
A possible mathematical explanation of this phenomena is mentioned in |4J . 

To test and evaluate the pattern recognition methodology we generate several 
test datasets of different type: 
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• A test set S e which is generated by the same procedure as for the training 
set D, but independently of D. 

• A test set Sp of randomly generated elements of F(X). 

• A test set Sp of (pseudo-) randomly generated primitive elements in 
F(X). Recall that w G F(X) is primitive if and only if there exists a se- 
quence of Whitehead automorphisms t\ . . . ti G £l(X) such that a;' 1 ""*' = w 
for some x G X . Elements in Sp are generated by the procedure de- 
scribed in [3], which, roughly speaking, amounts to a random choice of 
x G X and a random choice of a sequence of automorphisms t\ . . . ti G 
Cl(X). 

• A test set Sio which is generated in a way similar to the procedure used to 
generate the training set D. The only difference is that the non-minimal 
elements are obtained by applying not one, but several randomly chosen 
Whitehead automorphisms. The number of such automorphisms is chosen 
uniformly randomly from the set {1, . . . , 10}, hence the name. 

Features. We use the feature vectors fi(w), . . . , fe(w) described in Sectional 

Model. Our model is based on the linear regression classifier described in Section 
14.31 For any word w having feature vector z(w) we compute the discriminant 
function 

P(w) = b'z(w) 
where b' is the vector of regression coefficients obtained from the training data set 
D. The decision rule is based on the equal interval quantizing method described in 
Section IPI 

Evaluation. Let D eva i be a test data set. To evaluate the performance of the 
given PR system we use a simple accuracy measure: 

A = \{w G D eva i | minimality of w is decided correctly} 

It is the number of the correctly classified elements from the test set D eva i . 

Results. Now we present the evaluation of classifiers P/ on the test dataset S e 
when / runs over the set of feature mappings fi,...,fe mentioned above. By A(f) 
we denote the accuracy of the classifier P^- . For simplicity we present results only 
for the free group F of rank 2 with basis X = {a, b}. 

The results of evaluation of the classifiers Pi = P/ ; , i = 1, . . . , 6 on S e are given 
in Table [U 





A{h) 


Mh) 


A{h) 


A{h) 


Mh) 


Mh) 


H > o 


0.954 


0.968 


0.926 


0.869 


0.977 


0.980 


\w\ > 4 


0.957 


0.969 


0.927 


0.870 


0.977 


0.981 


\w\ > 100 


0.975 


0.984 


0.947 


0.893 


0.992 


0.994 



Table 1. Performance of the classifiers Pi, . . . , P6 on the set S e - 



One can draw the following conclusions from the experiments: 
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It seems, that the accuracy of the classifiers increases when one adds new 

edges to the graphs related to the feature mappings (though it is not clear 

what is the optimum set of features) ; 

The classifier Pg is the best so far, it is remarkably reliable; 

Very short words are difficult to classify (perhaps, because they do not 

provide sufficient information for the classifiers); 

The estimated conditional probabilities for Pg (which come from the 

Bayes' decision rule) are presented in Figure Q Clearly, the classes of 

minimal and non-minimal elements are separated around 0.5 with a small 

overlap. So the regression works perfectly with the threshold O = 0.5. 




regression value 



Figure 1. Conditional probabilities for P$ 



Looking for the best feature vectors. As we have seen above the performance of 
a classifier P^ directly depends on the feature vector f(w) built into it. Sometimes 
it is possible to reduce the number of features in / maintaining the same level of 
classification accuracy of, and even find more efficient combinations of the given 
features. The corresponding procedure is called feature selection. 

By far, the feature vector / 6 was the most effective. Observe, that / 6 has 60 
components (features). To find a better feature vector we used an iterative greedy 
procedure to select the best vector from the set of all counting functions of the type 

{C(w, xvy) \x,y€ X ±l ,v € F(X), 1 < \v\ < 3}. 

It turns out that one of the most effective feature vectors consists only of two 
counting functions: 



\w\ 



2 



The results of comparison of P* = Pf* with Pi and Pg are presented in Table 



PATTERN RECOGNITION IN FREE GROUPS 



15 





A(h) 


MM 


A(f*) 


M >o 


0.954 


0.980 


0.987 


\w\ > 4 


0.957 


0.981 


0.989 


\w\ > 100 


0.975 


0.994 


0.993 



Table 2. Comparative results for P» 



5.2. Clustering. In this section we describe one application of the K-means 
clustering scheme to the Whitehead minimization problem. 

In general cluster analysis is used to recover hidden structures in a sampled set 
of objects. In the Whitehead's minimization method the following Length Reduction 
Problem is of prime interest: given a non-minimal word w £ F find a (length- 
reducing) Whitehead automorphism t such that 

\wt\ < \w\. 

Below we apply K-means clustering method described in section 14.61 to the 
Length Reduction Problem. The task is to partition a given set of non-minimal 
elements into clusters in such a way that every cluster would have a Whitehead 
automorphism assigned to it which reduces the length of the most words from the 
cluster. To illustrate performance of the K-mcans algorithm we take the feature 
vector function f 2 from Section [3| and the standard Euclidean metric in R 4 (recall 
that f 2 (w) e I 4 for w € F 2 ). 

To perform the K-means procedure one needs to specify in advance the number 
K of expected clusters. Since we hope that every such cluster C will correspond 
to a particular Whitehead automorphism that reduces the length of elements in C 
then the expected number of clusters can be calculated as follows. It is easy to see 
that the set £l 2 of all Whitehead automorphisms of the free group F 2 (X) with basis 
X = {a. b} splits into two subsets: the set 



N, 



■ ab 
b 



■ a 

bo 



of Nielsen automorphisms and the set of conjugations. If we view elements of F 2 as 
cyclic words (i.e., up to a cyclic permutation) then the conjugations from Q 2 can 
be ignored in the length reduction problem. Therefore, we would like the K-means 
algorithm to find precisely 4 clusters. 

Let S C F 2 be a set of non- minimal cyclically reduced words from F 2 . We 
construct the set 

D=<f 2 (w) \weS> 

of feature vectors, corresponding to words in S. To start the K-means algorithm one 
needs to choose the set of initial centers /z°,i £ N 2 . Observe, that the algorithm is 
quite sensitive to this choice of the centers. There are various methods to generate 
/i°, t G N 2 , here we describe just one of them. Let S' be a sample subset of the set 
S. For an automorphism t G N 2 put 



C t = {w G S' | |iot| < |w|,Vr e N 2 {r ^ t -> \wr\ > \w\)} 



I.'i 
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and define 

(2) 



*° = iki E f^ 



\c, 



wec t 



as the initial estimates for the cluster centers (we assume here that the sets Ct are 
not empty). 

The goodness of the clustering is evaluated using a measure R max defined below. 
Let C C D be a cluster. For t £ N2 define 

I < i>||t(w)| < \v\,v e C > I 



R{t,C) = 



\C\ 



The number R(t, C) shows how many elements in C are reducible by t. Now put 

R max (C) = max{R(t,C) \ t e N 2 }. 

The number R m ax{C) shows how many elements in d can be reduced by a single 
automorphism. 

Now we perform 4-mcans clustering algorithm on the set S and compute 
R(t,Ci), t £ N2, for each obtained cluster Ci. The results of the experiments 
are given in Table 15.21 for two different choices of the initial centers (the random 
choice and the choice according to|2). We can see from the table that the clustering, 
indeed, groups elements in S with respect to the length reducing transformation. 





random /Uj 


j«t estimated by (J2J 


avg(R max ) 


0.930 


1.000 


max(i? mQ2; ) 


1.000 


1.000 


mm(R max ) 


0.743 


1.000 



Table 3. Evaluation of 4-means clustering of the set D. 



The experiments show that it is possible to cluster non-minimal elements in 
F2, using the standard clustering algorithms, in such a way that: 

• every cluster contains elements whose length can be (with a very high 
probability) reduced by a particular Nielsen transformation; 

• the transformation that reduces the length of the most elements from one 
cluster does not reduce the length of the most elements in another cluster. 

This gives a very strong heuristic for choosing a length reducing automorphism for 
a given word w S F%- 

A simple decision rule which for a given word w £ F% will predict a corre- 
sponding length reducing automorphism t can be defined as follows. Let fit be the 
centers of the clusters, produced by the K-means clustering. Each \it corresponds 
to a particular automorphism t £ JV2. For a given non- minimal cyclically reduced 
word w £ Fi we select an automorphism t* £ N2 such that 

VteN 2 (||/2(u;) - Mt .|| < ||/a(itf)-A**||) 

as the most probable length-reducing automorphism for w. 

In the conclusion we would like to add that a similar analysis can be used to 
predict most probable length-reducing automorphisms for words in free groups of 
ranks n larger then 2. However, the number of the corresponding clusters grows 



PATTERN RECOGNITION IN FREE GROUPS 17 

exponentially with n which increases the error rate of the classification. In this case 
more careful clustering still could be applied where the clusters correspond to some 
particular groups of Whitehead automorphism. 
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